Goto

Collaborating Authors

 Tuyên Quang Province


Multi-Dialect Vietnamese: Task, Dataset, Baseline Models and Challenges

Van Dinh, Nguyen, Dang, Thanh Chi, Nguyen, Luan Thanh, Van Nguyen, Kiet

arXiv.org Artificial Intelligence

Vietnamese, a low-resource language, is typically categorized into three primary dialect groups that belong to Northern, Central, and Southern Vietnam. However, each province within these regions exhibits its own distinct pronunciation variations. Despite the existence of various speech recognition datasets, none of them has provided a fine-grained classification of the 63 dialects specific to individual provinces of Vietnam. To address this gap, we introduce Vietnamese Multi-Dialect (ViMD) dataset, a novel comprehensive dataset capturing the rich diversity of 63 provincial dialects spoken across Vietnam. Our dataset comprises 102.56 hours of audio, consisting of approximately 19,000 utterances, and the associated transcripts contain over 1.2 million words. To provide benchmarks and simultaneously demonstrate the challenges of our dataset, we fine-tune state-of-the-art pre-trained models for two downstream tasks: (1) Dialect identification and (2) Speech recognition. The empirical results suggest two implications including the influence of geographical factors on dialects, and the constraints of current approaches in speech recognition tasks involving multi-dialect speech data. Our dataset is available for research purposes.


Vintern-1B: An Efficient Multimodal Large Language Model for Vietnamese

Doan, Khang T., Huynh, Bao G., Hoang, Dung T., Pham, Thuc D., Pham, Nhat H., Nguyen, Quan T. M., Vo, Bang Q., Hoang, Suong N.

arXiv.org Artificial Intelligence

In this report, we introduce Vintern-1B, a reliable 1-billion-parameters multimodal large language model (MLLM) for Vietnamese language tasks. By integrating the Qwen2-0.5B-Instruct language model with the InternViT-300M-448px visual model, Vintern-1B is optimized for a range of applications, including optical character recognition (OCR), document extraction, and general question-answering in Vietnamese context. The model is fine-tuned on an extensive dataset of over 3 million image-question-answer pairs, achieving robust performance and reliable results across multiple Vietnamese language benchmarks like OpenViVQA and ViTextVQA. Vintern-1B is small enough to fit into various on-device applications easily. Additionally, we have open-sourced several Vietnamese vision question answering (VQA) datasets for text and diagrams, created with Gemini 1.5 Flash. Our models are available at: https://huggingface.co/5CD-AI/Vintern-1B-v2.


Learning Theory for Estimation of Animal Motion Submanifolds

Powell, Nathan, Kurdila, Andrew

arXiv.org Machine Learning

This paper describes the formulation and experimental testing of a novel method for the estimation and approximation of submanifold models of animal motion. It is assumed that the animal motion is supported on a configuration manifold $Q$ that is a smooth, connected, regularly embedded Riemannian submanifold of Euclidean space $X\approx \mathbb{R}^d$ for some $d>0$, and that the manifold $Q$ is homeomorphic to a known smooth, Riemannian manifold $S$. Estimation of the manifold is achieved by finding an unknown mapping $\gamma:S\rightarrow Q\subset X$ that maps the manifold $S$ into $Q$. The overall problem is cast as a distribution-free learning problem over the manifold of measurements $\mathbb{Z}=S\times X$. That is, it is assumed that experiments generate a finite sets $\{(s_i,x_i)\}_{i=1}^m\subset \mathbb{Z}^m$ of samples that are generated according to an unknown probability density $\mu$ on $\mathbb{Z}$. This paper derives approximations $\gamma_{n,m}$ of $\gamma$ that are based on the $m$ samples and are contained in an $N(n)$ dimensional space of approximants. The paper defines sufficient conditions that shows that the rates of convergence in $L^2_\mu(S)$ correspond to those known for classical distribution-free learning theory over Euclidean space. Specifically, the paper derives sufficient conditions that guarantee rates of convergence that have the form $$\mathbb{E} \left (\|\gamma_\mu^j-\gamma_{n,m}^j\|_{L^2_\mu(S)}^2\right )\leq C_1 N(n)^{-r} + C_2 \frac{N(n)\log(N(n))}{m}$$for constants $C_1,C_2$ with $\gamma_\mu:=\{\gamma^1_\mu,\ldots,\gamma^d_\mu\}$ the regressor function $\gamma_\mu:S\rightarrow Q\subset X$ and $\gamma_{n,m}:=\{\gamma^1_{n,j},\ldots,\gamma^d_{n,m}\}$.